Fast Training of Nonlinear Embedding Algorithms

نویسندگان

  • Max Vladymyrov
  • Miguel Á. Carreira-Perpiñán
چکیده

Stochastic neighbor embedding (SNE) and related nonlinear manifold learning algorithms achieve high-quality low-dimensional representations of similarity data, but are notoriously slow to train. We propose a generic formulation of embedding algorithms that includes SNE and other existing algorithms, and study their relation with spectral methods and graph Laplacians. This allows us to define several partial-Hessian optimization strategies, characterize their global and local convergence, and evaluate them empirically. We achieve up to two orders of magnitude speedup over existing training methods with a strategy (which we call the spectral direction) that adds nearly no overhead to the gradient and yet is simple, scalable and applicable to several existing and future embedding algorithms. We consider a well-known formulation of dimensionality reduction: we are given a matrix of N ×N (dis)similarity values, corresponding to pairs of high-dimensional points y1, . . . ,yN (objects), which need not be explicitly given, and we want to obtain corresponding low-dimensional points x1, . . . ,xN ∈ R (images) whose Euclidean distances optimally preserve the similarities. Methods of this type have been widely used, often for 2D visualization, in all sort of applications (notably, in psychology). They include multidimensional scaling (originating in psychometrics and statistics; Borg & Groenen, 2005) and its variants such as Sammon’s mapping (Sammon, 1969), PCA defined on the Gram matrix, and several methods recently developed in machine learning: spectral methods such as Laplacian eigenmaps (Belkin & Niyogi, 2003) or locally linear embedding (Roweis & Saul, 2000), convex formulations such as maximum variance unfolding (Weinberger & Saul, 2006), and nonconvex formulations such as stochastic neighbor embedding (SNE; Hinton & Roweis, 2003) and Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). its variations (symmetric SNE, s-SNE: Cook et al., 2007; Venna & Kaski, 2007; t-SNE: van der Maaten & Hinton, 2008); kernel information embedding (Memisevic, 2006); and the elastic embedding (EE; Carreira-Perpiñán, 2010). Spectral methods have become very popular because they have a unique solution that can be efficiently computed by a sparse eigensolver, and yet they are able to unfold nonlinear, convoluted manifolds. That said, their embeddings are far from perfect, particularly when the data has nonuniform density or multiple manifolds. Better results have been obtained by the nonconvex methods, whose objective functions better characterize the desired embeddings. Carreira-Perpiñán (2010) showed that several of these methods (e.g. SNE, EE) add a point-separating term to the Laplacian eigenmaps objective. This causes improved embeddings: images of nearby objects are encouraged to project nearby but, also, images of distant objects are encouraged to project far away. However, a fundamental problem with nonconvex methods, echoed in most of the papers mentioned, has been their difficult optimization. First, they can converge to bad local optima. In practice, this can be countered by using a good initialization (e.g. from spectral methods), by simulated annealing (e.g. adding noise to the updates; Hinton & Roweis, 2003) or by homotopy methods (Memisevic, 2006; Carreira-Perpiñán, 2010). Second, numerical optimization has been found to be very slow. Most previous work has used simple algorithms, some adapted from the neural net literature, such as gradient descent with momentum and adaptive learning rate, or conjugate gradients. These optimizers are very slow with ill-conditioned problems and have limited the applicability of nonlinear embedding methods to small datasets; hours of training for a few thousand points are typical, which rules out interactive visualization and allows only a coarse model selection. Our goal in this paper is to devise training algorithms that are not only significantly faster but also scale up to larger datasets and generalize over a family of embedding algorithms (SNE, t-SNE, EE and others). We do this not by simply using an off-the-shelf optimizer, but by understanding the common structure of the Hessian in these algoPartial-Hessian Strategies for Fast Learning of Nonlinear Embeddings rithms and their relation with the graph Laplacian of spectral methods. Thus, our first task is to provide a general formulation of nonconvex embeddings (section 1) and understand their Hessian structure, resulting in several optimization strategies (section 2). We then empirically evaluate them (section 3) and conclude by recommending a strategy that is simple, generic, scalable and typically (but not always) fastest—by up to two orders of magnitude over existing methods. Throughout we write pd (psd) to mean positive (semi)definite, and likewise nd (nsd). 1. A General Embeddings Formulation Call X = (x1, . . . ,xN ) the d × N matrix of low-dimensional points, and define an objective function: E(X;λ) = E(X) + λE(X) λ ≥ 0 (1) where E is the attractive term, which is often quadratic psd and minimal with coincident points, and E is the repulsive term, which is often nonlinear and minimal when points separate infinitely. Optimal embeddings balance both forces. Both terms depend on X through Euclidean distances between points and thus are shift and rotation invariant. We obtain several important special cases: Normalized symmetric methods minimize the KL divergence between a posterior probability distribution Q over each point pair normalized by the sum over all point pairs (where K is a kernel function): qnm = K(‖xn − xm‖ 2 ) ∑N n,m=1 K(‖xn′ − xm′‖ 2 ) , qnn = 0 and a distribution P analogously defined on the data Y (thus constant wrt X) with possibly a different kernel and width. This is equivalent to choosing E(X) = − ∑N n,m=1 pnm logK(‖xn − xm‖ 2 ), E(X) = log ∑N n,m=1 K(‖xn − xm‖ 2 ) and λ = 1 in eq. (1). Particular cases are s-SNE (Cook et al., 2007) and t-SNE, with Gaussian and Student’s t kernels, resp. We will call pnm = w nm from now on. Normalized nonsymmetric methods consider instead per-point distributions Pn and Qn, as in the original SNE (Hinton & Roweis, 2003). Their expressions are more complicated and we focus here on the symmetric ones. Unnormalized models dispense with distributions and are simpler. For a Gaussian kernel, in the elastic embedding (EE; Carreira-Perpiñán, 2010) we have E(X) = ∑N n,m=1 w + nm ‖xn − xm‖ 2 and E(X) = ∑N n,m=1 w − nme −‖xn−xm‖ 2 , where W and W are symmetric nonnegative (with w nn = w nn = 0, n = 1, . . . , N ). Spectral methods such as Laplacian eigenmaps or LLE define E(X) = ∑N n,m=1 w + nm ‖xn − xm‖ 2 and E(X) = 0, with nonnegative affinities W, but add quadratic constraints to prevent the trivial solution X = 0. So E is as in EE and SNE. This formulation suggests previously unexplored algorithms, such as using an Epanechnikov kernel, or a t-EE, or using homotopy algorithms for SNE/t-SNE, where we follow the optimal path X(λ) from λ = 0 (where X = 0) to λ = 1. It can also be extended to closely related methods for embedding (kernel information embedding; Memisevic, 2006) and metric learning (neighborhood component analysis; Goldberger et al., 2005), among others. We express the gradient and Hessian (written as matrices of d×N and Nd×Nd, resp.) in terms of Laplacians, following Carreira-Perpiñán (2010), as opposed to the forms used in the SNE papers. This brings out the relation with spectral methods and simplifies the task of finding pd terms. Given an N ×N symmetric matrix of weights W = (wnm), we define its graph Laplacian matrix as L = D − W where D = diag ( ∑N n=1 wnm) is the degree matrix. Likewise we get L from w nm, L q from w nm, etc. L is psd if W is nonnegative (since uLu = 1 2 ∑N n,m=1 wnm(un − um) 2 ≥ 0). The Laplacians below always assume summation over points, so that the dimension-dependent Nd × Nd Laplacian L (from weights w in,jm) is really an N ×N Laplacian for each (i, j) point dimension. All other Laplacians are dimension-independent, of N ×N . Using this convention, we have for normalized symmetric models:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Link Prediction using Network Embedding based on Global Similarity

Background: The link prediction issue is one of the most widely used problems in complex network analysis. Link prediction requires knowing the background of previous link connections and combining them with available information. The link prediction local approaches with node structure objectives are fast in case of speed but are not accurate enough. On the other hand, the global link predicti...

متن کامل

Nonlinear optimized Fast Locking PLLs Using Genetic Algorithm

Abstract— This paper presents a novel approach to obtain fast locking PLL by embedding a nonlinear element in the loop of PLL. The nonlinear element has a general parametric Taylor expansion. Using genetic algorithm (GA) we try to optimize the nonlinear element parameters. Embedding optimized nonlinear element in the loop shows enhancements in speed and stability of PLL. To evaluate the perform...

متن کامل

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings

Stochastic neighbor embedding (SNE) and related nonlinear manifold learning algorithms achieve high-quality low-dimensional representations of similarity data, but are notoriously slow to train. We propose a generic formulation of embedding algorithms that includes SNE and other existing algorithms, and study their relation with spectral methods and graph Laplacians. This allows us to define se...

متن کامل

A fast, universal algorithm to learn parametric nonlinear embeddings

Nonlinear embedding algorithms such as stochastic neighbor embedding do dimensionality reduction by optimizing an objective function involving similarities between pairs of input patterns. The result is a low-dimensional projection of each input pattern. A common way to define an out-of-sample mapping is to optimize the objective directly over a parametric mapping of the inputs, such as a neura...

متن کامل

Two Novel Learning Algorithms for CMAC Neural Network Based on Changeable Learning Rate

Cerebellar Model Articulation Controller Neural Network is a computational model of cerebellum which acts as a lookup table. The advantages of CMAC are fast learning convergence, and capability of mapping nonlinear functions due to its local generalization of weight updating, single structure and easy processing. In the training phase, the disadvantage of some CMAC models is unstable phenomenon...

متن کامل

Fast Training of Graph-Based Algorithms for Nonlinear Dimensionality Reduction

Introduction Dimensionality reduction algorithms have long been used either for exploratory analysis of a high-dimensional dataset, to reveal structure such as clustering, or as a preprocessing step, by extracting low-dimensional features that are useful for classification or other tasks. Here we focus on dimensionality reduction algorithms where a dataset consisting of N objects is represented...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012